Adversarial Attack

Posted on 2022-06-16 | In paper note

A comprehensive survey can be found here.

Terminology:

black-box/white-box attack: the adversarial example is generated with or without knowing the prior knowledge of the target model.
targeted/non-targeted attack: whether predicting a specific label for the adversarial example.
universal perturbation: fool a given model on any image with high probability.

Attack

Backward Update
- add imperceptible distortion and increase the classification loss
- universal adversarial perturbation: learn a residual perturbation that works on most clean images
Forward Update
- one-pixel attack: use differential evolution algorithm
- Adversarial Transformation Networks: learn a network to translate clean image to adversarial example.

Defense

Use modified training samples during training or modified test samples during testing
Modify network: model parameters regularization, add a layer/module
Adversarial example detector: classify an example as adversarial or clean based on certain statistics

New perspective

Adversarial examples are not bugs, they are features.